69 research outputs found

    RLZAP: Relative Lempel-Ziv with Adaptive Pointers

    Full text link
    Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with a mismatch character usually gives better compression because many of the differences between individuals' genomes are single-nucleotide substitutions. Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers and run-length compressing them usually gives even better compression. In this paper we generalize Ferrada et al.'s idea to handle well also short insertions, deletions and multi-character substitutions. We show experimentally that our generalization achieves better compression than Ferrada et al.'s implementation with comparable random-access times

    Even faster sorting of (not only) integers

    Full text link
    In this paper we introduce RADULS2, the fastest parallel sorter based on radix algorithm. It is optimized to process huge amounts of data making use of modern multicore CPUs. The main novelties include: extremely optimized algorithm for handling tiny arrays (up to about a hundred of records) that could appear even billions times as subproblems to handle and improved processing of larger subarrays with better use of non-temporal memory stores

    Improving Transmission Efficiency of Large Sequence Alignment/Map (SAM) Files

    Get PDF
    Research in bioinformatics primarily involves collection and analysis of a large volume of genomic data. Naturally, it demands efficient storage and transfer of this huge amount of data. In recent years, some research has been done to find efficient compression algorithms to reduce the size of various sequencing data. One way to improve the transmission time of large files is to apply a maximum lossless compression on them. In this paper, we present SAMZIP, a specialized encoding scheme, for sequence alignment data in SAM (Sequence Alignment/Map) format, which improves the compression ratio of existing compression tools available. In order to achieve this, we exploit the prior knowledge of the file format and specifications. Our experimental results show that our encoding scheme improves compression ratio, thereby reducing overall transmission time significantly

    Predicting the deleterious effects of mutation load in fragmented populations.

    Get PDF
    Human-induced habitat fragmentation constitutes a major threat to biodiversity. Both genetic and demographic factors combine to drive small and isolated populations into extinction vortices. Nevertheless, the deleterious effects of inbreeding and drift load may depend on population structure, migration patterns, and mating systems and are difficult to predict in the absence of crossing experiments. We performed stochastic individual-based simulations aimed at predicting the effects of deleterious mutations on population fitness (offspring viability and median time to extinction) under a variety of settings (landscape configurations, migration models, and mating systems) on the basis of easy-to-collect demographic and genetic information. Pooling all simulations, a large part (70%) of variance in offspring viability was explained by a combination of genetic structure (F(ST)) and within-deme heterozygosity (H(S)). A similar part of variance in median time to extinction was explained by a combination of local population size (N) and heterozygosity (H(S)). In both cases the predictive power increased above 80% when information on mating systems was available. These results provide robust predictive models to evaluate the viability prospects of fragmented populations

    Relative Lempel-Ziv Compression of Suffix Arrays

    Get PDF
    We show that a combination of differential encoding, random sampling, and relative Lempel-Ziv (RLZ) parsing is effective for compressing suffix arrays, while simultaneously allowing very fast decompression of arbitrary suffix array intervals, facilitating pattern matching. The resulting text index, while somewhat larger (5-10x) than the recent r-index of Gagie, Navarro, and Prezza (Proc. SODA ’18)—still provides significant compression, and allows pattern location queries to be answered more than two orders of magnitude faster in practice.Peer reviewe

    State-of-the-Art in Weighted Finite-State Spell-Checking

    Get PDF
    Proceeding volume: 2The following claims can be made about finite-state methods for spell-checking: 1) Finite-state language models provide support for morphologically complex languages that word lists, affix stripping and similar approaches do not provide; 2) Weighted finite-state models have expressive power equal to other, state-of-the-art string algorithms used by contemporary spell-checkers; and 3) Finite-state models are at least as fast as other string algorithms for lookup and error correction. In this article, we use some contemporary non-finite-state spell-checking methods as a baseline and perform tests in light of the claims, to evaluate state-of-the-art finite-state spell-checking methods. We verify that finite-state spell-checking systems outperform the traditional approaches for English. We also show that the models for morphologically complex languages can be made to perform on par with English systems.Peer reviewe

    Sequencing and de novo assembly of 150 genomes from Denmark as a population reference

    Get PDF
    Hundreds of thousands of human genomes are now being sequenced to characterize genetic variation and use this information to augment association mapping studies of complex disorders and other phenotypic traits. Genetic variation is identified mainly by mapping short reads to the reference genome or by performing local assembly. However, these approaches are biased against discovery of structural variants and variation in the more complex parts of the genome. Hence, large-scale de novo assembly is needed. Here we show that it is possible to construct excellent de novo assemblies from high-coverage sequencing with mate-pair libraries extending up to 20 kilobases. We report de novo assemblies of 150 individuals (50 trios) from the GenomeDenmark project. The quality of these assemblies is similar to those obtained using the more expensive long-read technology. We use the assemblies to identify a rich set of structural variants including many novel insertions and demonstrate how this variant catalogue enables further deciphering of known association mapping signals. We leverage the assemblies to provide 100 completely resolved major histocompatibility complex haplotypes and to resolve major parts of the Y chromosome. Our study provides a regional reference genome that we expect will improve the power of future association mapping studies and hence pave the way for precision medicine initiatives, which now are being launched in many countries including Denmark
    corecore